Improving Synoptic Quering for Source Retrieval: Notebook for PAN at CLEF 2015

نویسندگان

  • Simon Suchomel
  • Michal Brandejs
چکیده

Source retrieval is a part of a plagiarism discovery process, where only a selected set of candidate documents is retrieved from a large corpus of potential source documents and passed for detailed document comparison in order to highlight potential plagiarism. This paper describes a used methodology and the architecture of a source retrieval system, developed for PAN 2015 lab on uncovering plagiarism, authorship and social software misuse. The system is based on our previous systems used at PAN since 2012. The paper also discusses the queries performance and provides explanation for many implementation settings. The proposed methodology achieved the highest recall with usage of the least number of queries among other PAN 2015 softwares during the official test run. The source retrieval subsystem forms an integral part of a modern system for plagiarism discovery.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Statistic and Semantic Analysis to Detect Plagiarism Notebook for PAN at CLEF 2013

This paper describes an approach submitted to the 2013 PAN competiton for the source retrieval sub-task. Three different methods for extracting queries were used, which employed tf-idf, noun phrases and named entities, in order to submit very different queries and maximize recall.

متن کامل

CopyCaptor : Plagiarized Source Retrieval System using Global Word Frequency and Local Feedback Notebook for PAN at CLEF 2013

In this paper, we present a plagiarized source retrieval system called CopyCaptor using global word frequency and local feedback to generate an effective query for finding plagiarized source documents from the given suspicious document on PAN’13 source retrieval task. The system achieved 3rd place in competition with 0.33 F1 score, 0.50 precision and 0.33 recall on the test which find appropria...

متن کامل

Heterogeneous Queries for Synoptic and Phrasal Search

This paper describes an architecture of the source retrieval system used at PAN 2014 lab on uncovering plagiarism, authorship, and social software misuse. The system is based on the systems used in past years at PAN 13 [6] and PAN 12 [5]. The majority of features were adapted with some improvements described in this paper. The source retrieval subsystem forms an integral part of a modern system...

متن کامل

Authorship Verification: An Approach based on Random Forest: Notebook for PAN at CLEF 2015

Authorship attribution, being an important problem in many areas including information retrieval, computational linguistics, law and journalism etc., has been identified as a subject of increasingly research interest in the recent years. In case of Author Identification task in PAN at CLEF 2015, the main focus was given on cross-genre and cross-topic author verification tasks. We have used seve...

متن کامل

Diverse Queries and Feature Type Selection for Plagiarism Discovery Notebook for PAN at CLEF 2013

This paper describes approaches used for the Plagiarism Detection task in PAN 2013 international competition on uncovering plagiarism, authorship, and social software misuse. We present modified three-way search methodology for Source Retrieval subtask and analyse snippet similarity performance. The results show, that presented approach is adaptable in real-world plagiarism situations. For the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015